Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: 0 Date: Aug 5, 2002 6:13am
Forum: millionbooks Subject: full text search

I've just spoken with Leon Bottou, and I think I understand how we should set things up to allow for a full-text search for books.

We can use bundled versions of the books instead of indirect, then use djvuserve so that the pages are quickly loaded.

From the bundled djvu files, there is a way to get ocr'd text files for each page, instead of a single text file for the whole bundled book. One way is "djvutotext -p bundledbook", which creates several text files for the book "bundledbook". Leon says that djvused can also be used and might allow for more flexibility.

Once we have the separate text files for each page of each book, we can run glimpse over the entire collection of text files. This will give us a list of all pages where a keyword occurs. Of course the user doesn't want a bunch of links from the same book, so we'll merge so that we get
a list of books where the keyword occurs, along with a list of page #s or at least the first page occurrence. When a user clicks a result link, we can invoke the djvu book using page number arguments in the cgi call.

This will pretty much do the trick, though it will only jump to the first occurrence. Leon said he'd like to see the reader allow a cgi argument in the form of "find=xxx" where xxx is a keyword and where the findnext dialog is opened automatically as the book opens to first occurrence.